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Abstract 

Imitation learning has been commonly applied to solve different tasks in isolation. 
This usually requires either careful feature engineering, or a significant number of 
samples. This is far from what we desire: ideally, robots should be able to learn 
from very few demonstrations of any given task, and instantly generalize to new 
situations of the same task, without requiring task-specific engineering. In this 
paper, we propose a meta-learning framework for achieving such capability, which 
we call one-shot imitation learning. 

Specifically, we consider the setting where there is a very large (maybe infinite) 
set of tasks, and each task has many instantiations. For example, a task could be 
to stack all blocks on a table into a single tower, another task could be to place 
all blocks on a table into two-block towers, etc. In each case, different instances 
of the task would consist of different sets of blocks with different initial states. 
At training time, our algorithm is presented with pairs of demonstrations for a 
subset of all tasks. A neural net is trained such that when it takes as input the first 
demonstration demonstration and a state sampled from the second demonstration, 
it should predict the action corresponding to the sampled state. At test time, a full 
demonstration of a single instance of a new task is presented, and the neural net 
is expected to perform well on new instances of this new task. Our experiments 
show that the use of soft attention allows the model to generalize to conditions and 
tasks unseen in the training data. We anticipate that by training this model on a 
much greater variety of tasks and settings, we will obtain a general system that can 
turn any demonstrations into robust policies that can accomplish an overwhelming 
variety of tasks. 


1 Introduction 

We are interested in robotic systems that are able to perform a variety of complex useful tasks, e.g. 
tidying up a home or preparing a meal. The robot should be able to learn new tasks without long 
system interaction time. To accomplish this, we must solve two broad problems. The first problem is 
that of dexterity: robots should learn how to approach, grasp and pick up complex objects, and how 
to place or arrange them into a desired configuration. The second problem is that of communication: 
how to communicate the intent of the task at hand, so that the robot can replicate it in a broader set of 
initial conditions. 

Demonstrations are an extremely convenient form of information we can use to teach robots to over¬ 
come these two challenges. Using demonstrations, we can unambiguously communicate essentially 
any manipulation task, and simultaneously provide clues about the specific motor skills required to 
perform the task. We can compare this with an alternative form of communication, namely natural 
language. Although language is highly versatile, effective, and efficient, natural language processing 
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Figure 1: (a) Traditionally, policies are task-specific. For example, a policy might have been 

trained through an imitation learning algorithm to stack blocks into towers of height 3, and then 
another policy would be trained to stack blocks into towers of height 2, etc. (b) In this paper, we 
are interested in training networks that are not specific to one task, but rather can be told (through a 
single demonstration) what the current new task is, and be successful at this new task. For example, 
when it is conditioned on a single demonstration for task F, it should behave like a good policy for 
task F. (c) We can phrase this as a supervised learning problem, where we train this network on a set 
of training tasks, and with enough examples it should generalize to unseen, but related tasks. To train 
this network, in each iteration we sample a demonstration from one of the training tasks, and feed it 
to the network. Then, we sample another pair of observation and action from a second demonstration 
of the same task. When conditioned on both the first demonstration and this observation, the network 
is trained to output the corresponding action. 


systems are not yet at a level where we could easily use language to precisely describe a complex 
task to a robot. Compared to language, using demonstrations has two fundamental advantages: first, 
it does not require the knowledge of language, as it is possible to communicate complex tasks to 
humans that don’t speak one’s language. And second, there are many tasks that are extremely difficult 
to explain in words, even if we assume perfect linguistic abilities: for example, explaining how to 
swim without demonstration and experience seems to be, at the very least, an extremely challenging 
task. 

Indeed, learning from demonstrations have had many successful applications . However, so far 
these applications have either required careful feature engineering, or a significant amount of system 
interaction time. This is far from what what we desire: ideally, we hope to demonstrate a certain task 
only once or a few times to the robot, and have it instantly generalize to new situations of the same 
task, without long system interaction time or domain knowledge about individual tasks. 

In this paper we explore the one-shot imitation learning setting illustrated in Fig. 1, where the 
objective is to maximize the expected performance of the learned policy when faced with a new, 
previously unseen, task, and having received as input only one demonstration of that task. For the 
tasks we consider, the policy is expected to achieve good performance without any additional system 
interaction, once it has received the demonstration. 

We train a policy on a broad distribution over tasks, where the number of tasks is potentially infinite. 
For each training task we assume the availability of a set of successful demonstrations. Our learned 
policy takes as input: (i) the current observation, and (ii) one demonstration that successfully solves 
a different instance of the same task (this demonstration is fixed for the duration of the episode). 
The policy outputs the current controls. We note that any pair of demonstrations for the same task 
provides a supervised training example for the neural net policy, where one demonstration is treated 
as the input, while the other as the output. 
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To make this model work, we made essential use of soft attention [6] for processing both the (poten¬ 
tially long) sequence of states and action that correspond to the demonstration, and for processing the 
components of the vector specifying the locations of the various blocks in our environment. The use 
of soft attention over both types of inputs made strong generalization possible. In particular, on a 
family of block stacking tasks, our neural network policy was able to perform well on novel block 
configurations which were not present in any training data. Videos of our experiments are available 
at http://bit.Iy/nips2017-oneshot. 

2 Related Work 

Imitation learning considers the problem of acquiring skills from observing demonstrations. Survey 
articles include [48, 11, 3]. 

Two main lines of work within imitation learning are behavioral cloning, which performs supervised 
learning from observations to actions (e.g., [41, 44]); and inverse reinforcement learning [37], where 
a reward function [1, 66, 29, 18, 22] is estimated that explains the demonstrations as (near) optimal 
behavior. While this past work has led to a wide range of impressive robotics results, it considers 
each skill separately, and having learned to imitate one skill does not accelerate learning to imitate 
the next skill. 

One-shot and few-shot learning has been studied for image recognition [61, 26, 47, 42], generative 
modeling [17, 43], and learning “fast” reinforcement learning agents with recurrent policies [16, 62]. 
Fast adaptation has also been achieved through fast-weights [5]. Like our algorithm, many of the 
aforementioned approaches are a form of meta-learning [58, 49, 36], where the algorithm itself is 
being learned. Meta-learning has also been studied to discover neural network weight optimization 
algorithms [8, 9, 23, 50, 2, 31]. This prior work on one-shot learning and meta-leaming, however, 
is tailored to respective domains (image recognition, generative models, reinforcement learning, 
optimization) and not directly applicable in the imitation learning setting. Recently, [19] propose a 
generic framework for meta learning across several aforementioned domains. However they do not 
consider the imitation learning setting. 

Reinforcement learning [56, 10] provides an alternative route to skill acquisition, by learning through 
trial and error. Reinforcement learning has had many successes, including Backgammon [57], 
helicopter control [39], Atari [35], Go [52], continuous control in simulation [51, 21, 32] and on 
real robots [40, 30]. However, reinforcement learning tends to require a large number of trials and 
requires specifying a reward function to define the task at hand. The former can be time-consuming 
and the latter can often be significantly more difficult than providing a demonstration [37]. 

Multi-task and transfer learning considers the problem of learning policies with applicability and 
re-use beyond a single task. Success stories include domain adaptation in computer vision [64, 34, 
28, 4, 15, 24, 33, 59, 14] and control [60, 45, 46, 20, 54]. However, while acquiring a multitude of 
skills faster than what it would take to acquire each of the skills independently, these approaches do 
not provide the ability to readily pick up a new skill from a single demonstration. 

Our approach heavily relies on an attention model over the demonstration and an attention model 
over the current observation. We use the soft attention model proposed in [6] for machine translations, 
and which has also been successful in image captioning [63]. The interaction networks proposed 
in [7, 12] also leverage locality of physical interaction in learning. Our model is also related to 
the sequence to sequence model [55, 13], as in both cases we consume a very long demonstration 
sequence and, effectively, emit a long sequence of actions. 

3 One Shot Imitation Learning 

3.1 Problem Formalization 

We denote a distribution of tasks by T, an individual task by t ~ T, and a distribution of demon¬ 
strations for the task t by B(£). A policy is symbolized by 7r^(a|o, d ), where a is an action, o is 
an observation, d is a demonstration, and 0 are the parameters of the policy. A demonstration 
d ~ B(£) is a sequence of observations and actions : d = [(oi, ai), ( 02 , c^), • • •, (<>r, g&t)]- We 
assume that the distribution of tasks T is given, and that we can obtain successful demonstrations for 
each task. We assume that there is some scalar-valued evaluation function Rt(d) (e.g. a binary value 
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indicating success) for each task, although this is not required during training. The objective is to 
maximize the expected performance of the policy, where the expectation is taken over tasks t E T, 
and demonstrations d G B(£). 

3.2 Block Stacking Tasks 

To clarify the problem setting, we describe a concrete example of a distribution of block stacking 
tasks, which we will also later study in the experiments. The compositional structure shared among 
these tasks allows us to investigate nontrivial generalization to unseen tasks. For each task, the goal is 
to control a 7-DOF Fetch robotic arm to stack various numbers of cube-shaped blocks into a specific 
configuration specified by the user. Each configuration consists of a list of blocks arranged into 
towers of different heights, and can be identified by a string. For example, ab cd ef gh means 
that we want to stack 4 towers, each with two blocks, and we want block A to be on top of block B, 
block C on top of block D, block E on top of block F, and block G on top of block H. Each of these 
configurations correspond to a different task. Furthermore, in each episode the starting positions 
of the blocks may vary, which requires the learned policy to generalize even within the training 
tasks. In a typical task, an observation is a list of (x,y,z) object positions relative to the gripper, 
and information if gripper is opened or closed. The number of objects may vary across different 
task instances. We define a stage as a single operation of stacking one block on top of another. For 
example, the task ab cd ef gh has 4 stages. 

3.3 Algorithm 

In order to train the neural network policy, we make use of imitation learning algorithms such 
as behavioral cloning and DAGGER [44], which only require demonstrations rather than reward 
functions to be specified. This has the potential to be more scalable, since it is often easier to 
demonstrate a task than specifying a well-shaped reward function [38]. 

We start by collecting a set of demonstrations for each task, where we add noise to the actions in order 
to have wider coverage in the trajectory space. In each training iteration, we sample a list of tasks 
(with replacement). For each sampled task, we sample a demonstration as well as a small batch of 
observation-action pairs. The policy is trained to regress against the desired actions when conditioned 
on the current observation and the demonstration, by minimizing an 1 2 or cross-entropy loss based on 
whether actions are continuous or discrete. A high-level illustration of the training procedure is given 
in Fig. 1(c). Across all experiments, we use Adamax [25] to perform the optimization with a learning 
rate of 0.001. 

4 Architecture 

While, in principle, a generic neural network could leam the mapping from demonstration and current 
observation to appropriate action, we found it important to use an appropriate architecture. Our 
architecture for learning block stacking is one of the main contributions of this paper, and we believe 
it is representative of what architectures for one-shot imitation learning could look like in the future 
when considering more complex tasks. 

Our proposed architecture consists of three modules: the demonstration network, the context network, 
and the manipulation network. An illustration of the architecture is shown in Fig. 2. We will describe 
the main operations performed in each module below, and a full specification is available in the 
Appendix. 

4.1 Demonstration Network 

The demonstration network receives a demonstration trajectory as input, and produces an embedding 
of the demonstration to be used by the policy. The size of this embedding grows linearly as a function 
of the length of the demonstration as well as the number of blocks in the environment. 

Temporal Dropout: For block stacking, the demonstrations can span hundreds to thousands of time 
steps, and training with such long sequences can be demanding in both time and memory usage. 
Hence, we randomly discard a subset of time steps during training, an operation we call temporal 
dropout , analogous to [53, 27]. We denote p as the proportion of time steps that are thrown away. 
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Figure 2: Illustration of the network architecture. 


In our experiments, we use p = 0.95, which reduces the length of demonstrations by a factor of 20. 
During test time, we can sample multiple downsampled trajectories, use each of them to compute 
downstream results, and average these results to produce an ensemble estimate. In our experience, 
this consistently improves the performance of the policy. 

Neighborhood Attention: After downsampling the demonstration, we apply a sequence of opera¬ 
tions, composed of dilated temporal convolution [65] and neighborhood attention. We now describe 
this second operation in more detail. 

Since our neural network needs to handle demonstrations with variable numbers of blocks, it must 
have modules that can process variable-dimensional inputs. Soft attention is a natural operation which 
maps variable-dimensional inputs to fixed-dimensional outputs. However, by doing so, it may lose 
information compared to its input. This is undesirable, since the amount of information contained 
in a demonstration grows as the number of blocks increases. Therefore, we need an operation that 
can map variable-dimensional inputs to outputs with comparable dimensions. Intuitively, rather than 
having a single output as a result of attending to all inputs, we have as many outputs as inputs, and 
have each output attending to all other inputs in relation to its own corresponding input. 

We start by describing the soft attention module as specified in [6]. The input to the attention includes 
a query q , a list of context vectors {cj}, and a list of memory vectors {mj}. The ith attention weight 
is given by Wi u T tanh (q + q), where v is a learned weight vector. The output of attention is a 
weighted combination of the memory content, where the weights are given by a softmax operation 
over the attention weights. Formally, we have output JT mi ^ . Note that the output has 

the same dimension as a memory vector. The attention operation can be generalized to multiple query 
heads, in which case there will be as many output vectors as there are queries. 

Now we turn to neighborhood attention. We assume there are B blocks in the environment. We 
denote the robot’s state as s ro bot, and the coordinates of each block as (aq, yi, z± ),..., (xb, Vb, %b)- 
The input to neighborhood attention is a list of embeddings h ™,..., h l g of the same dimension, 
which can be the result of a projection operation over a list of block positions, or the output of a 
previous neighborhood attention operation. Given this list of embeddings, we use two separate linear 
layers to compute a query vector and a context embedding for each block: qi Linear (/^ n ), and 
Ci Linear (h] n ). The memory content to be extracted consists of the coordinates of each block, 
concatenated with the input embedding. The ith query result is given by the following soft attention 
operation: result^ Soft Attn (query: context: {cj}^ =1 , memory: {((xj, yj,Zj), /^• n ))}j^ =1 ). 

Intuitively, this operation allows each block to query other blocks in relation to itself (e.g. find the 
closest block), and extract the queried information. The gathered results are then combined with 
each block’s own information, to produce the output embedding per block. Concretely, we have 
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output^ <— Linear(concat (/i^ n , result^, (^,^,^),s ro bot))- In practice, we use multiple query 
heads per block, so that the size of each result i will be proportional to the number of query heads. 

4.2 Context network 

The context network is the crux of our model. It processes both the current state and the embedding 
produced by the demonstration network, and outputs a context embedding, whose dimension does 
not depend on the length of the demonstration, or the number of blocks in the environment. Hence, it 
is forced to capture only the relevant information, which will be used by the manipulation network. 

Attention over demonstration: The context network starts by computing a query vector as a function 
of the current state, which is then used to attend over the different time steps in the demonstration 
embedding. The attention weights over different blocks within the same time step are summed 
together, to produce a single weight per time step. The result of this temporal attention is a vector 
whose size is proportional to the number of blocks in the environment. We then apply neighborhood 
attention to propagate the information across the embeddings of each block. This process is repeated 
multiple times, where the state is advanced using an LSTM cell with untied weights. 

Attention over current state: The previous operations produce an embedding whose size is inde¬ 
pendent of the length of the demonstration, but still dependent on the number of blocks. We then 
apply standard soft attention over the current state to produce fixed-dimensional vectors, where the 
memory content only consists of positions of each block, which, together with the robot’s state, forms 
the context embedding , which is then passed to the manipulation network. 

Intuitively, although the number of objects in the environment may vary, at each stage of the 
manipulation operation, the number of relevant objects is small and usually fixed. For the block 
stacking environment specifically, the robot should only need to pay attention to the position of the 
block it is trying to pick up (the source block), as well as the position of the block it is trying to place 
on top of (the target block). Therefore, a properly trained network can learn to match the current 
state with the corresponding stage in the demonstration, and infer the identities of the source and 
target blocks expressed as soft attention weights over different blocks, which are then used to extract 
the corresponding positions to be passed to the manipulation network. Although we do not enforce 
this interpretation in training, our experiment analysis supports this interpretation of how the learned 
policy works internally. 

4.3 Manipulation network 

The manipulation network is the simplest component. After extracting the information of the source 
and target blocks, it computes the action needed to complete the current stage of stacking one block 
on top of another one, using a simple MLP network. 1 This division of labor opens up the possibility 
of modular training: the manipulation network may be trained to complete this simple procedure, 
without knowing about demonstrations or more than two blocks present in the environment. We leave 
this possibility for future work. 

5 Experiments 

We conduct experiments with the block stacking tasks described in Section 3.2. 2 These experiments 
are designed to answer the following questions: 

• How does training with behavioral cloning compare with DAGGER? 

• How does conditioning on the entire demonstration compare to conditioning on the final 
state, even when it already has enough information to fully specify the task? 

• How does conditioning on the entire demonstration compare to conditioning on a “snapshot” 
of the trajectory, which is a small subset of frames that are most informative? 

Tn principle, one can replace this module with an RNN module. But we did not find this necessary for the 
tasks we consider. 

Additional experiment results are available in the Appendix, including a simple illustrative example of 
particle reaching tasks and further analysis of block stacking 
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• Can our framework generalize to tasks that it has never seen during training? 

To answer these questions, we compare the performance of the following architectures: 

• BC: We use the same architecture as previous, but and the policy using behavioral cloning. 

• DAGGER: We use the architecture described in the previous section, and train the policy 
using DAGGER. 

• Final state: This architecture conditions on the final state rather than on the entire demon¬ 
stration trajectory. For the block stacking task family, the final state uniquely identifies the 
task, and there is no need for additional information. However, a full trajectory, one which 
contains information about intermediate stages of the task’s solution, can make it easier to 
train the optimal policy, because it could learn to rely on the demonstration directly, without 
needing to memorize the intermediate steps into its parameters. This is related to the way in 
which reward shaping can significantly affect performance in reinforcement learning [38]. 
A comparison between the two conditioning strategies will tell us whether this hypothesis is 
valid. We train this policy using DAGGER. 

• Snapshot: This architecture conditions on a “snapshot” of the trajectory, which includes the 
last frame of each stage along the demonstration trajectory. This assumes that a segmentation 
of the demonstration into multiple stages is available at test time, which gives it an unfair 
advantage compared to the other conditioning strategies. Hence, it may perform better than 
conditioning on the full trajectory, and serves as a reference, to inform us whether the policy 
conditioned on the entire trajectory can perform as well as if the demonstration is clearly 
segmented. Again, we train this policy using DAGGER. 

We evaluate the policy on tasks seen during training, as well as tasks unseen during training. Note 
that generalization is evaluated at multiple levels: the learned policy not only needs to generalize to 
new configurations and new demonstrations of tasks seen already, but also needs to generalize to new 
tasks. 

Concretely, we collect 140 training tasks, and 43 test tasks, each with a different desired layout of the 
blocks. The number of blocks in each task can vary between 2 and 10. We collect 1000 trajectories 
per task for training, and maintain a separate set of trajectories and initial configurations to be used 
for evaluation. The trajectories are collected using a hard-coded policy. 

5.1 Performance Evaluation 


min. in 


i. 


(a) Performance on training tasks. 


(b) Performance on test tasks. 


Figure 3: Comparison of different conditioning strategies. The darkest bar shows the performance of the 
hard-coded policy, which unsurprisingly performs the best most of the time. For architectures that use temporal 
dropout, we use an ensemble of 10 different downsampled demonstrations and average the action distributions. 
Then for all architectures we use the greedy action for evaluation. 


Fig. 3 shows the performance of various architectures. Results for training and test tasks are presented 
separately, where we group tasks by the number of stages required to complete them. This is because 
tasks that require more stages to complete are typically more challenging. In fact, even our scripted 
policy frequently fails on the hardest tasks. We measure success rate per task by executing the greedy 
policy (taking the most confident action at every time step) in 100 different configurations, each 
conditioned on a different demonstration unseen during training. We report the average success rate 
over all tasks within the same group. 
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From the figure, we can observe that for the easier tasks with fewer stages, all of the different 
conditioning strategies perform equally well and almost perfectly. As the difficulty (number of stages) 
increases, however, conditioning on the entire demonstration starts to outperform conditioning on the 
final state. One possible explanation is that when conditioned only on the final state, the policy may 
struggle about which block it should stack first, a piece of information that is readily accessible from 
demonstration, which not only communicates the task, but also provides valuable information to help 
accomplish it. 

More surprisingly, conditioning on the entire demonstration also seems to outperform conditioning 
on the snapshot, which we originally expected to perform the best. We suspect that this is due 
to the regularization effect introduced by temporal dropout, which effectively augments the set of 
demonstrations seen by the policy during training. 

Another interesting finding was that training with behavioral cloning has the same level of performance 
as training with DAGGER, which suggests that the entire training procedure could work without 
requiring interactive supervision. In our preliminary experiments, we found that injecting noise into 
the trajectory collection process was important for behavioral cloning to work well, hence in all 
experiments reported here we use noise injection. In practice, such noise can come from natural 
human-induced noise through tele-operation, or by artificially injecting additional noise before 
applying it on the physical robot. 

5.2 Visualization 

We visualize the attention mechanisms underlying the main policy architecture to have a better 
understanding about how it operates. There are two kinds of attention we are mainly interested in, 
one where the policy attends to different time steps in the demonstration, and the other where the 
policy attends to different blocks in the current state. Fig. 4 shows some of the attention heatmaps. 



Block ID Time step (downsampled demonstration) 


(a) Attention over blocks in the current state. (b) Attention over downsampled demonstration. 

Figure 4: Visualizing attentions performed by the policy during an entire execution. The task 
being performed is ab cde f g hi j. Note that the policy has multiple query heads for each type 
of attention, and only one query head per type is visualized, (a) We can observe that the policy 
almost always focuses on a small subset of the block positions in the current state, which allows the 
manipulation network to generalize to operations over different blocks, (b) We can observe a sparse 
pattern of time steps that have high attention weights. This suggests that the policy has essentially 
learned to segment the demonstrations, and only attend to important key frames. Note that there are 
roughly 6 regions of high attention weights, which nicely corresponds to the 6 stages required to 
complete the task. 


6 Conclusions 

In this work, we presented a simple model that maps a single successful demonstration of a task to 
an effective policy that solves said task in a new situation. We demonstrated effectiveness of this 
approach on a family of block stacking tasks. There are a lot of exciting directions for future work. 
We plan to extend the framework to demonstrations in the form of image data, which will allow 
more end-to-end learning without requiring a separate perception module. We are also interested in 
enabling the policy to condition on multiple demonstrations, in case where one demonstration does 
not fully resolve ambiguity in the objective. Furthermore and most importantly, we hope to scale up 
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our method on a much larger and broader distribution of tasks, and explore its potential towards a 
general robotics imitation learning system that would be able to achieve an overwhelming variety of 
tasks. 
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A Illustrative Example: Particle Reaching 


The particle reaching problem is a very simple family of tasks. In each task, we control a point 
robot to reach a specific landmark, and different tasks are identified by different landmarks. As 
illustrated in Fig. 1, one task could be to reach the orange square, and another task could be to reach 
the green triangle. The agent receives its own 2D location, as well as the 2D locations of each of 
the landmarks. Within each task, the initial position of the agent, as well as the positions of all the 
landmarks, can vary across different instances of the task. 

Without a demonstration, the robot does not know which landmark it should reach, and will not be 
able to accomplish the task. Hence, this setting already gets at the essence of one-shot imitation, 
namely to communicate the task via a demonstration. After learning, the agent should be able to 
identify the target landmark from the demonstration, and reach the same landmark in a new instance 
of the task. 


'.O 




Figure 1: The robot is a point mass controlled with 2-dimensional force. The family of tasks is to reach a 
target landmark. The identity of the landmark differs from task to task, and the model has to figure out which 
target to pursue based on the demonstration, (left) illustration of the robot; (middle) the task is to reach the 
orange box, (right) the task is to reach the green triangle. 

We consider three architectures for this problem: 

• Plain LSTM: The first architecture is a simple LSTM with 512 hidden units. It reads the 
demonstration trajectory, the output of which is then concatenated with the current state, 
and fed to a multi-layer perceptron (MLP) to produce the action. 

• LSTM with attention: In this architecture, the LSTM outputs a weighting over the dif¬ 
ferent landmarks from the demonstration sequence. Then, it applies this weighting in the 
test scene, and produces a weighted combination over landmark positions given the current 
state. This 2D output is then concatenated with the current agent position, and fed to an 
MLP to produce the action. 

• Final state with attention: Rather than looking at the entire demonstration trajectory, this 
architecture only looks at the final state in the demonstration (which is already sufficient to 
communicate the task), and produce a weighting over landmarks. It then proceeds like the 
previous architecture. 

Notice that these three architectures are increasingly more specialized to the specific particle reach¬ 
ing setting, which suggests a potential trade-off between expressiveness and generalizability. 

The experiment results are shown in Fig. 2. We observe that as the architecture becomes more 
specialized, we achieve much better generalization performance. For this simple task, it appears 
that conditioning on the entire demonstration hurts generalization performance, and conditioning on 
just the final state performs the best even without explicit regularization. This makes intuitive sense, 
since the final state already sufficiently characterizes the task at hand. 

However, the same conclusion does not appear to hold as the task becomes more complicated, as 
shown by the block stacking tasks in the main text. 

Fig. 3 shows the learning curves for the three architectures designed for the particle reaching tasks, 
as the number of landmarks is varied, by running the policies over 100 different configurations, and 
computing success rates over both training and test data. We can clearly observe that both LSTM- 
based architectures exhibit overfitting as the number of landmarks increases. On the other hand, 
using attention clearly improves generalization performance, and when conditioning on only the 
final state, it achieves perfect generalization in all scenarios. It is also interesting to observe that 
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Figure 2: Success rates of different architectures for particle reaching. The “Train” curves show the success 
rates when conditioned on demonstrations seen during training, and running the policy on initial conditions 
seen during training, while the “Test” curves show the success rates when conditioned on new trajectories and 
operating in new situations. Both attention-based architectures achieve perfect training success rates, and the 
curves are overlapped. 


learning undergoes a phase transition. Intuitively, this may be when the network is learning to infer 
the task from the demonstration. Once this is finished, the learning of control policy is almost trivial. 

Table 1 and Table 2 show the exact performance numbers for reference. 


#Landmarks 

Plain LSTM 

LSTM with attention 

Final state with attention 

2 

100 . 0 % 

100 . 0 % 

100 . 0 % 

3 

100 . 0 % 

100 . 0 % 

100 . 0 % 

4 

100 . 0 % 

100 . 0 % 

100 . 0 % 

5 

100 . 0 % 

100 . 0 % 

100 . 0 % 

6 

99.0% 

100 . 0 % 

100 . 0 % 

7 

100 . 0 % 

100 . 0 % 

100 . 0 % 

8 

100 . 0 % 

100 . 0 % 

100 . 0 % 

9 

100 . 0 % 

100 . 0 % 

100 . 0 % 

10 

91.9% 

100 . 0 % 

100 . 0 % 


Table 1: Success rates of particle reaching conditioned on seen demonstrations, and running on seen 
initial configurations. 


B Further Details on Block Stacking 

B.l Full Description of Architecture 

We now specify the architecture in pseudocode. We omit implementation details which involve han¬ 
dling a minibatch of demonstrations and observation-action pairs, as well as necessary padding and 
masking to handle data of different dimensions. We use weight normalization with data-dependent 
initialization Salimans and Kingma [2016] for all dense and convolution operations. 
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(a) Plain LSTM (Train) 



(c) LSTM with attention (Train) 



(b) Plain LSTM (Test) 
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(d) LSTM with attention (Test) 




(e) Final state with attention (Train) (f) Final state with attention (Test) 

Figure 3: Learning curves for particle reaching tasks. Shown success rates are moving averages of 
past 10 epochs for smoother curves. Each policy is trained for up to 1000 epochs, which takes up to 
an hour using a Titan X Pascal GPU (as can be seen from the plot, most experiments can be finished 
sooner). 


B.1.1 Demonstration Network 

Assume that the demonstration has T time steps and we have B blocks. Our architecture only 
make use of the observations in the input demonstration but not the actions. Each observation is 
a (3 B + 2)-dimensional vector, containing the (x,y,z) coordinates of each block relative to the 
current position of the gripper, as well as a 2-dimensional gripper state indicating whether it is open 
or closed. 

The full sequence of operations is given in Module 1. We first apply temporal dropout as described 
in the main text. Then we split the observation into information about the block and information 
about the robot, where the first dimension is time and the second dimension is the block ID. The 
robot state is broadcasted across different blocks. Hence the shape of outputs should be T x B x 3 
and T x B x 2, respectively. 

Then, we perform a 1 x 1 convolution over the block states to project them to the same dimension 
as the per-block embedding. Then we perform a sequence of neighborhood attention operations and 
lxl convolutions, where the input to the convolution is the concatenation of the attention result, 
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Module 1 Demonstration Network _ 

Input: Demonstration d £ M Tx ( 3B + 2 ) 

Hyperparameters: p = 0.95, D = 64 

Output: Demonstration embedding £ M TxBxD , where T = \T( 1 — p)] is the length of the down- 
sampled trajectory. 

d' £- TemporalDropout(d, probability=p) 
block_state, robot_state A- Split (d' ) 
h ConvlD (block_state, kernel_size=l , channels=D) 
for a £ {1, 2,4, 8} do 
// Residual connections 
h' A- ReLU (h) 

attn_result A- NeighborhoodAttention (h' ) 
h' A- Concat ({h' , block_state, robot_state}, axis=-l) 
h' A- ConvlD(h ', kernel_size=2 , channels=D, dilation=a) 
h' A- ReLU (h' ) 
h £- h + h' 
end for 

demo_embedding A- h 


the current block position, and the robot’s state. This allows each block to query the state of other 
blocks, and reason about the query result in comparison with its own state and the robot’s state. We 
use residual connections during this procedure. 

B.1.2 Context Network 

The pseudocode is shown in Module 2. We perform a series of attention operations over the demon¬ 
stration, followed by attention over the current state, and we apply them repeatedly through an 
LSTM with different weights per time step (we found this to be slightly easier to optimize). Then, 
in the end we apply a final attention operation which produces a fixed-dimensional embedding in¬ 
dependent of the length of the demonstration or the number of blocks in the environment. 

B.1.3 Manipulation Network 

Given the context embedding, this module is simply a multilayer perceptron. Pseudocode is given 
in Module 3. 

B.2 Evaluating Permutation Invariance 

During training and in the previous evaluations, we only select one task per equivalence class, where 
two tasks are considered equivalent if they are the same up to permuting different blocks. This is 
based on the assumption that our architecture is invariant to permutations among different blocks. 
For example, if the policy is only trained on the task abed, it should perform well on task deba, 
given a single demonstration of the task deb a. We now experimentally verify this property by 
fixing a training task, and evaluating the policy’s performance under all equivalent permutations of 
it. As Fig. 4 shows, although the policy has only seen the task abed, it achieves the same level of 
performance on all other equivalent tasks. 

B.3 Effect of Ensembling 

We now evaluate the importance of sampling multiple downsampled demonstrations during evalu¬ 
ation. Fig. 5 shows the performance across all training and test tasks, as the number of ensembles 
varies from 1 to 20. We observe that more ensembles helps the most for tasks with fewer stages. 
On the other hand, it consistently improves performance for the harder tasks, although the gap is 
smaller. We suspect that this is because the policy has learned to attend to frames in the demonstra¬ 
tion trajectory where the blocks are already stacked together. In tasks with only 1 stage, for example, 
it is very easy for these frames to be dropped in a single downsampled demonstration. On the other 
hand, in tasks with more stages, it becomes more resilient to missing frames. Using more than 10 
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Module 2 Context Network _ 

Input: Demonstration embedding hi n £ R TxBxD , current state s £ M 3B+2 
Hyperparameters: D = 64, ti st m = 4, H = 2 
Output: Context embedding £ M 2+6f/ 

// Split the current state into block state £ R BxS and robot state broadcasted to all blocks £ R Bx2 
block_state, robot_state <— SplitSingle (s) 

// Initialize LSTM output £ R BxD and state (including hidden and cell state) £ M Bx2D 
output, state <— InitLSTMState(size=B, hidden_dim=D) 
for t = 1 to tistm do 

// Temporal attention: every block attend to the same time step 
x £- output 

if t > 1 then 

x <— ReLU (x) 

end if 

// Computing query for attention over demonstration £ R B x D 
q £- Dense (x, output_dim=D) 

// Compute result from attention £ ]^ HxBxD 

temp <— SoftAttention (query=q, context =h_in, memory=h_in, num_heads=H) 

// Reorganize result into shape B x (HD) 

temp Reshape(Transpose(temp, (1, 0, 2)), (B, H*D)) 

// Spatial attention: each block attend to a different block separately 
x <— output 

if t > 1 then 

x <— ReLU (x) 

end if 

x A- Concat ({x, temp}, axis=-l) 

// Computing context for attention over current state £ R BxD 
ctx £ Dense (x, output_dim=D) 

// Computing query for attention over current state £ R B x D 
q <— Dense (x, output_dim=D) 

// Computing memory for attention over current state £ R Bx ( HD + 3 ) 
mem Concat ({block_state, temp}, axis=-l) 

// Compute result from attention £ ]$i BxHx ( HD +3) 

spatial SoftAttention (query=q, context=ctx, memory=mem, num_heads=H) 

// Reorganize result into shape B x H(HD + 3) 

spatial Reshape (spatial, (B, H* (H*D + 3) ) ) 

// Form input to the LSTM cell £ ]\i Bx ( h ( hd + 3 )+h d +8) 

input Concat ({robot_state, block_state, spatial, temp}, axis=-l) 

// Run one step of an LSTM with untied weights (meaning that we use different weights per 
time step 

output, state LSTMOneStep(input=input, state=state) 

end for 

// Final attention over the current state, compressing an 0(B) representation down to 0(1) 

// Compute the query vector. We use a fixed, trainable query vector independent of the input data, 
with size £ M 2xD (we use two queries, originally intended to have one for the source block and 
one for the target block) 

q GetFixedQuery() 

// Get attention result, which should be of shape 2 x H x 3 

r SoftAttention (query=q, context=output, memory=block_state, num_heads=H) 

// Form the final context embedding (we pick the first robot state since no need to broadcast here) 

context_embedding Concat ({robot_state [ 0 ] , Reshape (r, 2*H*3)}) 
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Module 3 Manipulation Network 

Input: Context embedding h{ n G R 2+6f/ 
Hyperparameters: H = 2 

Output: Predicted action distribution G 

h G- ReLU (Dense (h_in, output_dim=256)) 
h <— ReLU (Dense (h, output_dim=256)) 
action_dist <— Dense (h, output_dim= | A | ) 


80 % 
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/ # / / ? / # ? f ^ J J f # 
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Figure 4: Performance of policy on a set of tasks equivalent up to permutations. 


ensembles appears to provide no significant improvements, and hence we used 10 ensembles in our 
main evaluation. 
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Figure 5: Performance of various number of ensembles. 
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B.4 Breakdown of Failure Cases 


To understand the limitations of the current approach, we perform a breakdown analysis of the failure 
cases. We consider three failure scenarios: “Wrong move” means that the policy has arranged a 
layout incompatible with the desired layout. This could be because the policy has misinterpreted the 
demonstration, or due to an accidental bad move that happens to scramble the blocks into the wrong 
layout. “Manipulation failure” means that the policy has made an irrecoverable failure, for example 
if the block is shaken off the table, which the current hard-coded policy does not know how to handle. 
“Recoverable failure” means that the policy runs out of time before finishing the task, which may be 
due to an accidental failure during the operation that would have been recoverable given more time. 
As shown in Fig. 6, conditioning on only the final state makes more wrong moves compared to other 
architectures. Apart from that, most of the failure cases are actually due to manipulation failures 
that are mostly irrecoverable. 1 This suggests that better manipulation skills need to be acquired to 
make the learned one-shot policy more reliable. 



Figure 6: Breakdown of the success and failure scenarios. The area that each color occupies represent the ratio 
of the corresponding scenario. 


B.5 Learning Curves 

Fig. 7 shows the learning curves for different architectures designed for the block stacking tasks. 
These learning curves do not reflect final performance: for each evaluation point, we sample tasks 
and demonstrations from training data, reset the environment to the starting point of some particular 
stage (so that some blocks are already stacked), and only run the policy for up to one stage. If the 
training algorithm is DAGGER, these sampled trajectories are annotated and added to the training 
set. Hence this evaluation does not evaluate generalization. We did not perform full evaluation as 
training proceeds, because it is very time consuming: each evaluation requires tens of thousands of 

'Note that the actual ratio of misinterpreted demonstrations may be different, since the runs that have caused 
a manipulation failure could later lead to a wrong move, were it successfully executed. On the other hand, by 
visually inspecting the videos, we observed that most of the trajectories categorized as “Wrong Move” are 
actually due to manipulation failures (except for policy conditioning on the final state, which does seem to 
occasionally execute an actual wrong move). 
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trajectories across over >100 tasks. However, these figures are still useful to reflect some relative 
trend. 

From these figures, we can observe that while conditioning on full trajectories gives the best per¬ 
formance which was shown in the main text, it requires much longer training time, simply because 
conditioning on the entire demonstration requires more computation. In addition, this may also be 
due to the high variance of the training process due to downsampling demonstrations, as well as 
the fact that the network needs to learn to properly segment the demonstration. It is also interesting 
that conditioning on snapshots seems to learn faster than conditioning on just the final state, which 
again suggests that conditioning on intermediate information is helpful, not only for the final policy, 
but also to facilitate training. We also observe that learning happens most rapidly for the initial 
stages, and much slower for the later stages, since manipulation becomes more challenging in the 
later stages. In addition, there are fewer tasks with more stages, and hence the later stages are not 
sampled as frequently as the earlier stages during evaluation. 

B.6 Exact Performance Numbers 

Exact performance numbers are presented for reference: 

• Table 3 and Table 4 show the success rates of different architectures on training and test 
tasks, respectively; 

• Table 5 shows the success rates across all tasks as the number of ensembles is varied; 

• Table 6 shows the success rates of tasks that are equivalent to abed up to permutations; 

• Table 7, Table 8, Table 9, Table 10, and Table 11 show the breakdown of different success 
and failure scenarios for all considered architectures. 


B.7 More Visualizations 

Fig. 8 and Fig. 9 show the full set of heatmaps of attention weights. Interestingly, in Fig. 8, we 
observe that rather than attending to two blocks at a time, as we originally expected, the policy has 
learned to mostly attend to only one block at a time. This makes sense because during each of the 
grasping and the placing phase of a single stacking operation, the policy needs to only pay attention 
to the single block that the gripper should aim towards. For context, Fig. 10 and Fig. 11 show key 
frames of the neural network policy executing the task. 


References 
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#Landmarks 

Plain LSTM 

LSTM with attention 

Final state with attention 

2 

100 . 0 % 

100 . 0 % 

100 . 0 % 

3 

100 . 0 % 

100 . 0 % 

100 . 0 % 

4 

99.0% 

100 . 0 % 

100 . 0 % 

5 

98.0% 

100 . 0 % 

100 . 0 % 

6 

99.0% 

100 . 0 % 

100 . 0 % 

7 

98.0% 

100 . 0 % 

100 . 0 % 

8 

93.9% 

99.0% 

100 . 0 % 

9 

83.8% 

94.9% 

100 . 0 % 

10 

50.5% 

85.9% 

100 . 0 % 


Table 2: Success rates of particle reaching conditioned on unseen demonstrations, and running on 
unseen initial configurations. 


#Stages 

Demo 

DAGGER 

BC 

Snapshot 

Final state 

1 

99 . 1 % 

99 . 1 % 

99 . 1 % 

97.2% 

98.8% 

2 

95 . 6 % 

94 . 3 % 

93.7% 

92.6% 

86.7% 

3 

88 . 5 % 

88 . 0 % 

86.9% 

86.7% 

84.8% 

4 

78 . 6 % 

78 . 2 % 

76.7% 

76.4% 

71.9% 

5 

67 . 3 % 

65 . 9 % 

65.4% 

62.5% 

60.6% 

6 

55 . 7 % 

51.5% 

52 . 4 % 

47.0% 

43.6% 

7 

42 . 8 % 

34.3% 

37 . 5 % 

31.4% 

31.5% 


Table 3: Success rates of different architectures on training tasks of block stacking. 


#Stages 

Demo 

DAGGER 

BC 

Snapshot 

Final state 

2 

95.8% 

949% ' 

95 . 9 % 

92.8% 

941% 

4 

77 . 6 % 

77.0% 

74.8% 

77 . 2 % 

75.8% 

5 

65 . 9 % 

65 . 9 % 

64.3% 

61.1% 

51.9% 

6 

49.4% 

50 . 6 % 

46.5% 

42.6% 

35.9% 

7 

46 . 5 % 

36.5% 

38 . 5 % 

32.8% 

32.0% 

8 

29 . 0 % 

18.0% 

24 . 0 % 

19.0% 

20.0% 


Table 4: Success rates of different architectures on test tasks of block stacking. 


#Stages 

1 Ens. 

2 Ens. 

5 Ens. 

10 Ens. 

20 Ens. 

1 

91.9% 

95.4% 

98.8% 

99 . 1 % 

98.7% 

2 

92.3% 

92.2% 

94.5% 

94 . 6 % 

94.1% 

3 

86.0% 

86.8% 

87.9% 

88 . 0 % 

87.9% 

4 

76.6% 

77.4% 

77.9% 

78.0% 

78 . 3 % 

5 

65.1% 

65.0% 

65.3% 

65 . 9 % 

65.5% 

6 

49.0% 

50.4% 

50.1% 

51 . 3 % 

50.8% 

7 

34.4% 

36.1% 

36.0% 

34.9% 

36 . 8 % 

8 

20.0% 

21 . 0 % 

21 . 0 % 

18.0% 

20.0% 


Table 5: Success rates of varying number of ensembles using the DAGGER policy conditioned on 
full trajectories, across both training and test tasks. 
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Task ID 

Success Rate 

abed 

83.0% 

abdc 

86.0% 

acbd 

92.0% 

aedb 

84.0% 

adbc 

91.0% 

adeb 

88.0% 

bacd 

92.0% 

bade 

90.0% 

bead 

92.0% 

beda 

88.0% 

bdac 

94.0% 

bdea 

88.0% 

cabd 

82.0% 

cadb 

87.0% 

ebad 

95.0% 

cbda 

87.0% 

edab 

91.0% 

cdba 

93.0% 

dabc 

90.0% 

dacb 

92.0% 

dbac 

88.0% 

dbca 

90.0% 

dcab 

91.0% 

deba 

84.0% 


Table 6: Success rates of a set of tasks that are equivalent up to permutations, using the DAGGER 
policy conditioned on full trajectories. 


#Stages 

Success 

Recoverable failure 

Manipulation failure 

Wrong move 

1 

99.3% 

0.0% 

0.7% 

0.0% 

2 

95.9% 

0.4% 

3.7% 

0.0% 

3 

89.1% 

0.7% 

10.1% 

0.1% 

4 

79.2% 

1.2% 

19.4% 

0.1% 

5 

67.5% 

1.4% 

30.9% 

0.2% 

6 

55.2% 

1.4% 

43.1% 

0.3% 

7 

44.6% 

1.7% 

53.2% 

0.6% 

8 

30.9% 

4.3% 

64.9% 

0.0% 


Table 7: Breakdown of success and failure scenarios for Demo policy. 


#Stages 

Success 

Recoverable failure 

Manipulation failure 

Wrong move 

1 

99.4% 

M% 

0.6% 

M% 

2 

95.3% 

0.9% 

3.8% 

0.0% 

3 

89.1% 

1.9% 

8.8% 

0.1% 

4 

79.5% 

3.5% 

16.7% 

0.3% 

5 

69.1% 

5.0% 

25.6% 

0.3% 

6 

55.8% 

7.3% 

36.4% 

0.5% 

7 

39.0% 

8.6% 

51.5% 

0.8% 

8 

21.2% 

14.1% 

62.4% 

2.4% 


Table 8: Breakdown of success and failure scenarios for DAGGER policy. 
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(c) Stage 1 (d) Stage 2 



(e) Stage 3 



(f) Stage 4 



(g) Stage 5 


100% ------ Po l ity Typ e 



Figure 7: Learning curves of block stacking task. The first plot shows the average success rates over 
initial configurations of all stages. The subsequent figures shows the breakdown of each stage. For 
instance, “Stage 3” means that the first 3 stacking operations are already completed, and the policy 
is evaluated on its ability to perform the 4th stacking operation. 
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#Stages 

Success 

Recoverable failure 

Manipulation failure 

Wrong move 

1 

99.6% 

0 Wo 

UWo 

0 Wo 

2 

95.6% 

1.1% 

3.2% 

0.1% 

3 

88.1% 

2.2% 

9.5% 

0.2% 

4 

78.5% 

4.5% 

16.8% 

0.2% 

5 

67.2% 

6.6% 

25.7% 

0.4% 

6 

53.9% 

8.3% 

37.1% 

0.6% 

7 

40.6% 

9.8% 

48.7% 

0.9% 

8 

27.0% 

13.5% 

58.4% 

1.1% 


Table 9: Breakdown of success and failure scenarios for BC policy. 


#Stages 

Success 

Recoverable failure 

Manipulation failure 

Wrong move 

1 

99.1% 

0 Wo 

OWo 

0 Wo 

2 

94.5% 

1.6% 

3.8% 

0.1% 

3 

88.0% 

2.5% 

9.3% 

0.2% 

4 

78.9% 

4.6% 

16.2% 

0.3% 

5 

65.6% 

8.0% 

25.8% 

0.6% 

6 

50.8% 

8.3% 

40.2% 

0.7% 

7 

36.1% 

9.2% 

54.2% 

0.4% 

8 

21.6% 

11.4% 

65.9% 

1.1% 


Table 10: Breakdown of success and failure scenarios for Snapshot policy. 


#Stages 

Success 

Recoverable failure 

Manipulation failure 

Wrong move 

1 

99.2% 

M% 

o% 

M% 

2 

95.1% 

1.3% 

3.6% 

0.0% 

3 

86.7% 

2.5% 

9.7% 

1.1% 

4 

75.2% 

4.0% 

18.3% 

2.5% 

5 

60.5% 

4.3% 

31.2% 

4.0% 

6 

45.5% 

4.7% 

45.5% 

4.3% 

7 

34.9% 

5.6% 

57.3% 

2.2% 

8 

24.1% 

3.6% 

72.3% 

0.0% 


Table 11: Breakdown of success and failure scenarios for Final state policy. 



(a) Head 0 



(b) Head 1 




(d) Head 3 


Figure 8: Heatmap of attention weights over different blocks of all 4 query heads. 
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Figure 9: Heatmap of attention weights over downsampled demonstration trajectory of all 6 query 
heads. There are 2 query heads per step of LSTM, and 3 steps of LSTM are performed. 
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Figure 10: Illustration of the task used for the visualization of attention heatmaps (first half). The 
task is ab cde f g hi j. The left side shows the key frames in the demonstration. The right side 
shows how, after seeing the entire demonstration, tthe policy reproduces the same layout in a new 
initialization of the same task. 
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Figure 11: Illustration of the task used for the visualization of attention heatmaps (second half). The 
task is ab cde f g hi j. The left side shows the key frames in the demonstration. The right side 
shows how, after seeing the entire demonstration, tthe policy reproduces the same layout in a new 
initialization of the same task. 
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